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Abstract 

We consider bandit problems involving a large (possibly infinite) collection of arms, in which 
the expected reward of each arm is a linear function of an r-dimensional random vector Z £ M r , 
where r > 2. The objective is to minimize the cumulative regret and Bayes risk. When the set of 
arms corresponds to the unit sphere, we prove that the regret and Bayes risk is of order Q(ry/T), 
by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is 
obtained through a policy that alternates between exploration and exploitation phases. The phase- 
based policy is also shown to be effective if the set of arms satisfies a strong convexity condition. 
For the case of a general set of arms, we describe a near-optimal policy whose regret and Bayes 
risk admit upper bounds of the form 0{r\fT log 3 / 2 T). 

1. Introduction 



Since its introduction by iThompsonl (|1933l ). the multiarmed bandit problem has served as an 
important model for decision making under uncertainty. Given a set of arms with unknown reward 
profiles, the decision maker must choose a sequence of arms to maximize the expected total payoff, 
where the decision in each period may depend on the previously observed rewards. The multiarmed 
bandit problem elegantly captures the tradeoff between the need to exploit arms with high payoff 
and the incentive to explore previously untried arms for information gathering purposes. 

Much of the previous work on the multiarmed b andit problem assumes that the reward s of the 
arms are statistically independent (see, for example. lLai and Robbind (|1985l ) and lLail (119871 )) . This 
assumption enables us to consider each arm separately, but it leads to policies whose regret scales 
linearly with the number of arms. Most policies that assume independence require each arm to be 
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tried at least once, and are impractical in settings involving many arms. In such settings, we want 
a policy whose regret is independent of the number of arms. 

W hen the mean rewards of the arms are assumed to be independent random variables, 



Lai and Robbins 



(|1985l ) show that the regret under an arbitrary policy must increase linearly with the number of 
arms. However, the assumption of independence is quite strong in practice. In many applications, 
the information obtained from pulling one arm can change our understanding of other arms. For 
instance, in marketing applications, we expect a priori that similar products should have similar 
sales. By exploiting the correlation among products/arms, we should be able to obtain a policy 
whose regret scales more favorably than traditional bandit algorithms that ignore correlation and 
ass ume independenc e. 



Mersereau et al 



(|2009l ) propose a simple model that demonstrates the benefits of exploiting 
the underlying structure of the rewards. They consider a bandit problem where the expected 
reward of each arm is a linear function of an unknown scalar, with a known prior distribution. 
Since the reward of each arm depends on a single random variable, the mean rewards are perfectly 
correlated. They prove that, under certain assumptions, the cumulative Bayes risk over T periods 
(defined below) under a greedy policy admits an O (log T) upper bound, independent of the number 
of arms. 



In this paper, we extend the model of 



Mersereau et al 



(|2009l ) to the setting where the expected 
reward of each arm depends linearly on a multivariate random vector Z £ W . We concentrate on 
the case where r > 2, which is fundamentally different from the previous model because the mean 
rewards now depend on more than one random variable, and thus, they are no longer perfectly 



Mersereau et al 



correl ated. The bounds on the regret and Bayes risk and the policies found in 
( 2OO9I ) no longer apply. To give a flavor for the differences, we will show that, in our model, the 



cumulative Bayes risk under an arbitrary policy is at least Q(ry/T), which is significantly higher 
than the upper bound of O(logT) attainable when r = 1. 

The linearly par ameterized bandit is a n important model that has been studied by many re- 
searchers, including Ginebra and Clayton ( 1995 ). Abe and LonJ (1999), and 
suits in this paper complement and extend the earlier and independent work of 
in a number of directions. We provide a detailed comparison between our work and the existing 
literature in Sections 11.31 and ll.4i 



Auerl (I2002h. The re - 



Dani et al. 
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1.1 The Model 

We have a compact set U r C ffi r that corresponds to the set of arms, where r > 2. The reward X" 
of playing arm u G U r in period i is given by 



where u'Z is the inner product between the vector u G W r and the random vector Z £ l r . We 
assume that the random variables are independent of each other and of Z. Moreover, for 
each u G the random variables {VF" : £ > 1} are identically distributed, with E [W t u ] = for 
all t and u. We allow the error random variables to have unbounded support, provided that 
their moment generating functions satisfy certain conditions (given in Assumption [1]) . Each vector 
u G U r simultaneously represents an arm and determines the expected reward of that arm. So, 
when it is clear from the context, we will interchangeably refer to a u G lA r as either a vector or an 
arm. 

Let us introduce the following conventions and notation that will be used throughout the paper. 
We denote vectors and matrices in bold. All vectors are column vectors. For any vector v G M. r , 
its transpose is denoted by v', and is always a row vector. Let denote the zero vector, and 
for k = 1, . . . , r, let = (0, . . . , 1, . . . , 0) denote the standard unit vector in M r , with a 1 in the 
k th component and a elsewhere. Also, let denote the k x k identity matrix. We let A' and 
det(A) denote the transpose and determinant of A, respectively. If A is a symmetric positive 
semidefinite matrix, then A m i n (A) and A max (A) denote the smallest and the largest eigenvalues of 
A, respectively. We use A 1 / 2 to denote its symmetric nonnegative definite square root, so that 
A 1 / 2 A 1 / 2 = A. If A is also positive definite, we let A -1 / 2 = (A -1 ) 1 ^ 2 . For any vector v, ||v|| = 
vV'v denotes the standard Euclidean norm, and for any positive definite matrix A, ||v|| A = Vv'Av 
denotes a corresponding weighted norm. For any two symmetric positive semidefinite matrices A 
and B, we write A < B if the matrix B — A is positive semidefinite. Also, all logarithms log(-) 
in the paper denote the natural log, with base e. A random variable is denoted by an uppercase 
letter while its realized values are denoted in lowercase. 

For any t > 1, let Ht-i denote the set of possible histories until the end of period t — 1. A 
policy ip = (^pi,ip2, • • •) is a sequence of functions such that ipt : Ht-i — > U r selects an arm in period 
t based on the history until the end of period t — 1. For any policy tp and z G W, the T-period 
cumulative regret under ip given Z = z, denoted by Regret (z, T, ip), is defined by 



X f u = u'Z + W t 



t i 



u 



(1) 



T 





E max v'z — Ulz Z = z 



t=l 
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where for any t > 1, Uf G is the arm chosen under ip in period t. Since U r is compact, 
maxygi/r v ' z is well defined for all z. The T-period cumulative Bayes risk under ip is defined by 

Risk(T, if)) = E [Regret(Z, T, ^)] , 

where the expectation is taken with respect to the prior distribution of Z. We aim to develop a 
policy that minimizes the cumulative regret and Bayes risk. We note that minimizing the T-period 
cumulative Bayes risk is equivalent to maximizing the expected total reward over T periods. 

To facilitate exposition, when we discuss a particular policy, we will drop the superscript and 
write Xt and Wt to denote X^ 1 and W 4 Ut , respectively, where JJt is the arm chosen by the policy 
in period t. With this convention, the reward obtained in period t under a particular policy is 
simply X t = U' t Z + W t . 

1.2 Potential Applications 

Although our paper focuses on a theoretical analysis, we mention briefly potential applications 
to problems in marketing and revenue management. Suppose we have m arms indexed by U r = 
{ui,u 2 , . . . , u m } C W. For k = l,...,r, let <fik = («i,fe, U2,k, ■ ■ ■ > «m,fe) £ ^ m denote an Tri- 
dimensional column vector consisting of the k th coordinates of the different vectors u^. Let 
fi = (fii, . . . , fi m ) be the column vector consisting of expected rewards, where fi£ denotes the 
expected reward of arm u^. Under our formulation, the vector fi lies in an r-dimensional subspace 
spanned by the vectors (f>\, . . . , (j> r , that is, [i = 5Zfe=i ^k4>ki where Z = (Z\, . . . , Z r ). If each arm 
corresponds to a product to be offered to a customer, we can then interpret the vector as a fea- 
ture vector or basis function, representing a particular characteristic of the products such as price 
or popularity. We can then interpret the random variables Zi,...,Z r as regression coefficients, 
obtained from approximating the vector of expected rewards using the basis functions <j>\, . . . , (f) r , 
or more intuitively, as the weights associated with the different characteristics. Given a prior on the 
coefficients Zk, our goal is to choose a sequence of products that gives the maximum expected total 
reward. This representation suggests that our model might be applicable to problems where we 
can approximate high-dimensional vectors using a linear combination of a few basis functions, an 



app roach that has been successful 



(see iBertsekas and Tsitsiklid (119961 ) for an overview) . 



y applied to high-dimensional dynamic programming problems 



1.3 Related Literature 

The multiarmed bandit literature can be divided into two streams, depending on the objective 
function criteria: maximizing the total discounted reward over an infinite horizon or minimizing 
the cumulative regret and Bayes risk over a finite horizon. Our paper focuses exclusively on the 
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second criterion. Much of the work in this area focuses on understanding the rat e with which the 
regret and risk under various policies increase over time. In their pioneering work, lLai and Robbins 



(|1985l ) establish an asymptotic lower bound of fi(mlogT) for the T-period cumulative regret for 
bandit problems with m independent arms whose mean rewards are "well-separated," where the 
difference between the expected reward of the best and second best arms is fixed and bounded 
away from zero. They further demonstrate a policy whose regret asymptotically matches the lower 
bound. In contrast, our paper focuses on problems where the number of arms is large (possibly 
infinite), and where the gap between the m aximum ex pected reward and the expected reward of 
the second best arm can be arbitrarily small. lLa.il (119871 ) extends these results to a Bayesian setting, 
with a prior distribution on the reward characteristics of each arm. He shows that when we have m 



arms, the T-period cumulative Bayes risk is of order 0(mlog 2 T), when the 
continuous density function s atisfying certain prop er ties (see Theore m 3 in 
papers along this line include lAgrawal et al.l (|1989l ) , lAgrawall (119951 ) , and 



prior dist ribution has a 



Lai 



1987 



Auer et al 



. Subs equent 

ihood ). 



There has be en relatively little research, however, on policies that exploit the dependence 
among the arms. iThompsonl (| 19331 ) allows for correlation among arm s in his initial formulation, 



though he only analyzes a special case involving independent arms. 



Robbinsl (| 19521 ) formulates 



a co ntinuum-armed bandit reg ression problem, but does not provide an analysis of the regret or 



risk. 



Berry and Fristedt 



|l985l ) allow for dependence among arms in their formula t ion in Chapter 
2, but mostly focus on the case of independent arms. iFeldmanl (|1962l ) and iKeenerl (119851 ) consider 
two-armed bandit problems wi th two hidden states, where the rewards of each arm depend on the 
underlying state that prevails. Pressman and Soninl (|1990l ) formulate a general multiarmed bandit 



problem with an ar 
two hidden states 



jitrary number of h idden states, and provide a detailed analysis for the case of 



Pandev et al 



(|2007l ) study bandit problems where the dependence of the arm 
rewards is represented by a hierarchical model. 

A s omewh at related literature on bandits with de pendent arms is the recent work by 



(2005a 



Wang et al 



bj) and lGoldenshluger and Zeevil (|2008l . l2009i ) who consider bandit problems with two arms, 



where the expected reward of each arm depends on an exogenous variable that represents side 
information. These models, however, differ from ours because they assume that the side information 
variables are independent and identically distributed over time, and moreover, these variables are 
perfectly observed before we choose which arm to play. In contrast, we assume that the underlying 
random vector Z is unknown and fixed over time, to be estimated based on past rewards and 
decisions. 

Our formulation can be viewed as a sequential method for maximizing a linear function based 
on noisy observations of the function values, and it is thus closely related to the field of stochas- 
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Robbins and Monro (1951 



) and 



Kiefer and Wolfowitz 



tic ap proximation, which was developed by 
(|1952l ). We do not provide a com preh e nsive review of the literature here; interested readers are 
referred to an excellent survey by lLail (120031 ). In stochastic approximation, we wish to find an 
adaptive sequence {Ut 6 M r : t > 1} that converges to a maximizer u* of a target function, and 
the focus is on establishi ng the rate a t w hich the mean squ ared error E 
to zero (see, for example, iBluml . 



1954 and 



Cicek et al. 



u 



converges 



2009). In contrast, our cumulative regret 



and Bayes risk criteria take into account the cost associated with each observation. The different 
performance measures used in our formulation lead to entirely different policies and performance 
characteristics. 

Our model generalizes the "response surface bandits" introduced bv lGinebra and Clayton! (|1995l ). 
who assume a norm al prior on Z and prov i de a simple t unable heuristic, without any analysis on 
the regret or risk. lAbe and LoneJ (|1999l ). 



Auerl (120021 ) . and 



Dani et al. 



(2008a|) all consider a 



special case of our model where the random vector Z and the error random variables are 
bounded almo s t sure ly, and with the exception of the last paper, focus on the regret criterion. 



Abe and Lond ( 1993 ) demonstrate a class of bandits where the dimension r is at least Q(y/T), and 



show that the T-period regret under an arbitrary policy must be at least f2 (T 3//4 ). lAuerl (2002J) 
describes an algorithm based on least squares estimation and confidence bounds, and est ablishes an 

uppe r bound on t he regret, for the case of finitely many arms. 



Dani et al 



(|2008al ) show that the policy of lAuerl (|2002l ) can be extended to problems having an arbitrary com- 
pact set of arms, and also make use of a barycentric spanner. They establish an 0(rVTlog 3/2 T) 
upper bound on the regret, and discuss a v ariation of t he pol icy that is more computationally 
tractable (at the expense of higher regret). I Dani et al.1 ( 2008a! ) also establish an £l(ry/T) lower 
bound on the Bayes risk when the set of arms is the Cartesian product of circles^- However, this 
leaves a 0(log 3//2 T) gap from the upper bound, leaving open the question of the exact order of 
regret and risk. 

1.4 Contributions and Organizations 

One of our contributions is proving that the regret and Bayes risk for a broad class of linearly 
parameterized bandits is of order Q(ry/T). In Section [21 we establish an £l(ry/T) lower bound for 
an arbitrary policy, when the set of arms is the unit sphere in M r . Then, in Section^ we show that 
a matching 0{r^/T) upper bound can be achieved through a phase-based policy that alternates 
between exploration and exploitation phases. To the best of our knowledge, this is the first result 
that establishes matching upper and lower bounds for a class of linearly parameterized bandits. 



1 The original lowe r bound (Theorem 3 on page 360 of 



Dani et al 



was provided later, in 



Dani et al 



2008af ) was not entirely correct; a correct ver 



( 2008b ). 
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'able [T] summarizes our results and provides a comparison with the results in 



Mersereau et al 



(|2009l ) for the case r = 1. In the ensuing discussion of the bounds, we focus on the main parameters, 



r and T, with more precise statements given in the theorems. 

Although we ob tain the same lowe r bound of £l(r\/T), our example and proof techniques are 



very different from 



Dani et al. 



(|2008al ). We consider the unit sphere, with a multivariate normal 
prior on Z, and standard normal errors. The analysis in Section [2] also illuminates the behavior of 
the least mean squares estimator in this setting, and we believe that it provides an approach that 
can be used to address more general classes of linear estimation and adaptive control problems. 

We also prove that the phase-based policy remains effective (that is, admits an 0(ry/T) upper 
bound) for a broad class of bandit problems in which the set of arms is strongly convejo (defined 
in Section [3]). To our knowledge, this is the first result that establishes the connection between 
a geometrical property (strong convexity) of the underlying set of arms and the effectiveness of 
separating exploration from exploitation. We suspect that strong convexity may have similar 
implications for other types of bandit and learning problems. 

When the set of arms is an arbitrary compact set, the separation of exploration and exploitation 
may not be effective, and we consider in Section0]an active exploration policy based on least squares 
estimation and confidence regions. We prove that the regret and risk under this policy are bounded 
above by 0(rVT log 3//2 T), which is wit hin a l ogarit hmic factor of the lower bound. Our policy is 
closely related to the one considered in lAuerl (120021 ) and further analyzed in I Dani et al.l (|2008al ) , 
with differences in a number of respects. First, our model allows the random vector Z and the 
errors W" to have unbounded support, which requires a somewhat more complicated analysis. 
Second, our policy is an "anytime" policy, in the se nse that the poli cy does not depend on the 



time horizon T of interest. In contrast, the policies of lAuerl (|2002l ) and lDani et al.l (|2008al ) involve 
a certain parameter 5 whose value must be set in advance as a function of the time horizon T in 
order to obtain the O (^r y/T log 3 ^ 2 T^J regret bound. 

We finally comment on the case where the set of arms is finite and fixed. We show that the 
regret and risk under our active exploration policy increase gracefully with time, as log T and log 2 T, 
respectively. These re sults show that our polic y is with i n a co nstant factor of the asymptotic lower 
bound s est ablished by Lai and Robbins ( 1985 ) and Lai ( 1987 ). In contrast, for the policies of lAuer 



Dani et al 



(120021 ) and 

and log 3 T, respectively 



( 2008al ). the available regret upper bounds grow over time as \/Tlog 3 ^ 2 T 



2 One can show that the Cartesian product of circles is not strongl y convex, and thus , our phase-based policy cannot 
be applied to give the matching upper bound for the example used in iDani et al.1 ( 2008al ) . 
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Dimension 


Set of 


T-period Cumulative Regret 


T-period Cumulative Bayes Risk 


(r) 


Arms (U r ) 


Lower Bound 


Upper Bound 


Lower Bound 


Upper Bound 


r = 1 


Any Compact Set 

(Mersoroau ot al., 2008) 




/ / — \ 

o(yr) 


n(iogT) 


O(logT) 


r > 2 


Unit Sphere 
(Sections [2] and [3]) 




O (p/Tj 




o (rVr^ 


(This Paper) 


Any Compact Set 
(Section |U) 




O (rVTlog i/2 T^ 




O (rVTlog 3/2 T) 



Table 1: Regret and risk bounds for various values of r and for different collections of arms. 

We note that the bounds on the cumulative Bayes risk given in Table [I] hold under certain 
assumptions on the prior distribution of the random vector Z. For r = 1, Z is assumed to be 
a co ntinuous random variable with a bounded density function (Theorem 3.2 in 



Mcrsereau et al. 



2009). When the collection of arms is a unit sphere with r > 2, we require that both E [||Z||] and 
E [1/ ||Z||] are bounded (see Theorems 12.11 and 13.11 and Lemma 13 . 2[) . For general compact sets of 
arms where our risk bound is not tight, we only require that ||Z|| has a bounded expectation. 

2. Lower Bounds 

In this section, we establish Q(ry/T) lower bounds on the regret and risk under an arbitrary policy 
when the set of arms is the unit sphere. This result is stated in the following theorenjf] 

Theorem 2.1 (Lower Bounds). Consider a bandit problem where the set of arms is the unit sphere 
in M r , and has a standard normal distribution with mean zero and variance one for all t and 
u. If Z has a multivariate normal distribution with mean and covariance matrix I r /r, then for 
all policies ip and every T > r 2 , 



Risk(T,V) > 0.006 r VT . 
Consequently, for any policy ip and T > r 2 , there exists z£R r such that 

Regret (z, T, ip) > 0.006 r VT . 



3 The result of Theorem 12.11 easily extends to the case where the covariance matrix is I r , rather than I r /r. The proof 
is essentially the same. 
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It suffices to establish the lower bound on the Bayes risk because the regret bound follows im- 
mediately. Throughout this section, we assume that li r = {u G W : ||u|| = 1}. We fix an arbitrary 
policy ip and for any t > 1, we let Hj = (Ui , X\ , U2 , X2 , . . . , Ut,Xt) be the history up to time t. 
We also let Zj denote the least mean squares estimator of Z given the history Ht, that is, 

% = E [Z I H t ] . 

Let Sj , . . . , S£ _1 denote a collection of orthogonal unit vectors that are also orthogonal to Z^. Note 
that Zit and Sj, . . . , S[ _1 are functions of Ht. 

Since VL r is the unit sphere, max u6 ^ r u'z = (z'z) / |z| = ||z||, for all z S M r . Thus, the risk 
in period t is given by E[||Z|| — UjZ]. The following lemma establishes a lower bound on the 
cumulative risk in terms of the estimator error variance and the total amount of exploration along 
the directions S^, . . . , S^T . 



Lemma 2.2 (Risk Decomposition). For any T > 1, 

T 



Risk(r,^) > i^E || Z ||£(u^)V^{(z-z T )'s|y 

fe=l L i=l II II I J 

Proof. Using the fact that for any two unit vectors w and v, 1 — w'v = ||w — v|| 2 /2, the instan- 
taneous regret in period t satisfies 



u;z 



1 - u 



u 4 



r-l 



> 



k=l 



£ u 



Sk 
r 



where the inequality follows from the fact that S^, . . . , 1 are orthogonal unit vectors. Therefore, 
the cumulative conditional risk satisfies 



T 



U^Z 



H 7 



> 



r-l 



£ E 

T r- 

££e 

t=i fc=i 

T r-l 

££e 

t=i fc=i 



zii£ u 



fe=i 



U* 



|Z| 
Z 



Sk 
T 



Sk 
T 







r 








r 
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with probability one. From the definition of S^, we have Z r S^ = for k = 1 . . . , r — 1. Therefore, 
for t < T, 



(u' t s T 



Z'S T 



H 



T 



U' t S^ ) E 



Z' 



H, 



T 
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which eliminates the middle term in the summand above. Furthermore, we see that Z'S^ 



Zt I S T for all k. Thus, with probability one, 



T 



ujz 



t=l 



H 



T 



> 



r-1 
k=l 



z||£te T ) + 



t=l 



Z - Zi 



Sk 
T 



Hi 



and the desired result follows by taking the expectation of both sides. 
Since S T is orthogonal to Zt, we can interpret E^Li (UtS T ) 2 and 



rp 



□ 



as the 



total amount of exploration over T periods and the squared estimation error, respectively, in the 
direction S T . Thus, Lemma 12.21 tells us that the cumulative risk is bounded below by the sum 
of the squared estimation error and the total amount of exploration in the past T periods. This 
result suggests an approach for establishing a lower bound on the risk. If the amount of exploration 
Et=i (UjS^) 2 is large, then the risk will be large. On the other hand, if the amount of exploration 
is small, we expect significant estimation errors, which in turn imply large risk. This intuition 
is made precise in Lemma 12.31 which relates the squared estimation error and the amount of 
exploration. 

Lemma 2.3 (Little Exploration Implies Large Estimation Errors). For any k and T > 1, 

2 



Z — Zy ) Srp 



H 



T 



> 



1 



k\2 



with probability one. 

Proof. Let Q T = Z T /||Z T ||. For any t, we have that XJ t = £fe=! (UJS£) S T 



(UjQ r ) Q t Let 



Sr 1 Qt) 



be an r x r orthonormal matrix whose columns are the vectors S r , 



Then, it is easy to verify that 



^U*u; = VAV , 



t=i 



, S T r , and Qt, respectively. 



where A 



£ c 



, is an r x r matrix, with a = Q' T I Ylt=i UtUj I Qt, c is an (r — 1) 



dimensional column vector, and where S is an (r— l)x(r-l) matrix with £ w = (S T )' fet=i U * U t) S t 
ELi (U'tS T ) (U£S T ) for fc, I = 1, . . . , r - 1, 

Since Z has a multivariate normal prior distribution wit h covariance mat rix I r /r, it is a standard 



Bertsekad . 



result (use, for example, Corollary E.3.5 in Appendix E in 
w .V 1 / , T 

Z — Zt 



19951 ) that 



H 



T 



Ir + ^U t UM =V(rI r + A)- 1 V / 
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Since Sy is a function of Hy and V'S^ = e&, we have, for k < r — 1, that 

2 



Z - Zi 



H 7 



> 



V'S^J (rl r + A)" 1 (v'S^ 
1 1 



(rl r + A) -1 



kk 



( rI r + ^)kk r + Y^Li (U*S$.) 



2 " 



wher e the inequality follows from Fiedler's Inequality (see, for example, Theorem 2.1 in lFiedler and Ptakl . 
19971 ). and the final equality follows from the definition of A. □ 



The next lemma gives a lower bound on the probability that Z is bounded away from the origin. 
The proof follows from simple calculations involving normal densities, and the details are given in 
Appendix lA.li 

Lemma 2.4. For any 6 < 1/2 and /3 > 0, Pr{(9 < ||Z|| < /3} > 1 - AO 2 - 4 . 

The last lemma establishes a lower bound on the sum of the total amount of exploration and 
the squared estimation error, which is also the minimum cumulative Bayes risk along the direction 
by Lemma O 

Lemma 2.5 (Minimum Directional Risk). For k = 1, . . . , r — 1, and T > r 2 , 



|Z||£(5 S T 

t=l 



+ 



T 



Z-Zrl 



> 0.027^ . 



We note that if ||Z|| were a constant, rather than a random variable, Lemma 12.51 would follow 
immediately. Hence, most of the work in the proof below involves constraining ||Z|| to a certain 
range [0,/?]. 

Proof. Consider an arbitrary k, and let S = Y^t=i (U^S^) 2 , T = / ^Z — Z^ S^j> . Our proof 
will make use of positive constants 6, {3, and r/, whose values will be chosen later. Note that 



IZIIH + 



rr 



H 



T 



> E 



IZIIS + 



rr 



< l|Z|| < > VT} 



Hi 



IZIIH- 



rr 



l|Z|| < /3}J1{ S < VT} 



Hi 



> ^Vri {3 >^ } E[i {e <|| Z n<^ } |h t ] 



T 

" ~p 1 {E< VT} E i T1 {e < liz 



II < P} 



H 7 



where we use the fact that H is a function of in the final inequality. We will now lower bound 
the last term on the right hand side of the above inequality. Let = E [r | H-p] . Since is a 
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function of Hy, 
T 



T T 

~~5 \s < v / t} E [ r \o < liz|l < ft I Ht ] - ~5 \z < Vt}E [r 1{0 < ||z|| < /?}l{r > v e} | Ht] 



/3 

> f 01 {S <Vr}E 



Me < ||z|| < /?}l{r > ,,6} 



-"{# < ||z|| < /?}l{r > r;e} 



H7 1 

H 



where the last inequality follows from Lemma 12.31 which implies that, with probability one, 



r]T rjT 1 rjT 1 ^v 7 ?^ 

— < Vr} > — • < VT} > — ■ 1{ S < v^} > -2^-l{a < v^T} . 



and where the last inequality follows from the fact that T > r 2 , and thus, 1/ ( r + \/tJ > l/(2y/T). 
Putting everything together, we obtain 







i l|Z||H+ l|z|| 







> eVft 



{e>Vt 



} E [!{ 



< l|Z|| < 13} 



H 7 



J {f < l|Z|| < /3}l{r > r?0} 



H 7 



\e < ||z|j < /?}l{r > r?e} 



with probability one. By the Bonferroni Inequality, we have that 



E[l 



{8 < l|Z|| < /3} 1 {r > r;G} 



H T ] = Pr{<9 < ||Z|| < /3 and T > r/G | H T } 

> Pr{(9 < ||Z|| < /3|H T }+Pr{r > ^ejHr}-!, 



with probability one. Conditioned on Hy, — Z^J is normally distributed with mean zero 
and variance 



Z — Zy ) Sj, 



H 7 



e 



Let <&(•) be the cumulative distribution function of the standard normal random variable, that is, 

Pr{r>77e|H T } = Pr|| (Z-Z T )'S|| > v^Ve I H T } =2(l-$(^y)) , 
from which it follows that, with probability one, 



\o < ||Z|| < /3}l{r > r?e} 



H 



T 



>Pr{<9< ||Z|| < p I H r }+2(l-$(v^))-l 



Therefore, 



IZIIS + 



rr 



H 7 



> min 



< ||Z|| < /3 
12 



h t } + 2(i-$(0?))-i] \/r 



with probability one, which implies that 

rr 



+ 



> mm!e,^\[Pr{e < ||Z|| < /3} + 2 (1 - $ (y/rj)) - 1] VT , 



> min 



2/3 



T 



where the last inequality follows from Lemma 12.41 Set 6 = 0.09, /3 = 3, and r] = 0.5, to obtain 

□ 



||Z|| H + > 0.027-v/T , which is the desired result. 
Finally, here is the proof of Theorem 12.11 
Proof. It follows from Lemmas 12.21 and 12.51 that 

1 

Risk(r,^) > -£e 



> 



fc=i 

r - 1 



z »E( u ^) 2 +m{ 



Z — Zy ) Srp 



•0.027\/T > --0.027\/T > O.OOGrv^, 
2 ~ 4 ~ 



where we have used the fact r > 2, which implies that r — 1 > r/2. 



3. Matching Upper Bounds 



□ 



We have established $1 yryTj lower bounds when the set of arms U r is the unit sphere. We now 
prove that a policy that alternates between exploration and exploitation phases yields matching 
upper bounds on the regret and risk, and is therefore optimal for this problem. Surprisingly, we 
will see that the phase-based policy is effective for a large class of bandit problems, involving a 
strongly convex set of arms. We introduce the following assumption on the tails of the error random 
variables and on the set of arms U r , which will remain in effect throughout the rest of paper. 

Assumption 1. 

(a) There exists a positive constant ctq such that for any r > 2, u E U r , t > 1, and x E R, we 
have E [e xW ?] < e^o/ 2 . 

(b) There exist positive constants u and Ao such that for any r > 2, 

max llull < u , 

and the set of arms U r C W has r linearly independent elements bi,...,b r such that 



Amin (Efe=l b fc b fc) > A 



0- 



Under Assumption [fla), the tails of the distribution of the errors decay at least as fast as 
for a normal random variable with variance <r 2 . The first part of Assumption [T](b) ensures that 
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the expected reward of the arms remain bounded as the dimension r increases, while the arms 
bi, . . . ,b r given in the second part of Assumption [TJ^b) will be used during the exploration phase 
of our policy. 

Our policy - which we refer to as the Phased Exploration and Greedy Exploitation 
(PEGE) - operates in cycles, and in each cycle, we alternate between exploration and exploitation 
phases. During the exploration phase of cycle c, we play the r linearly independent arms from 
Assumption QJb). Using the rewards observed during the exploration phases in the past c cycles, 
we compute an ordinary least squares (OLS) estimate Z(c). In the exploitation phase of cycle c, 
we use Z(c) as a proxy for Z and compute a greedy decision G(c) G U r defined by: 

G(c) = arg max v'Z(c) , (2) 

veUr 

where we break ties arbitrarily. We then play the arm G(c) for an additional c periods to complete 
cycle c. Here is a formal description of the policy. 

Phased Exploration and Greedy Exploitation (PEGE) 
Description: For each cycle c > 1, complete the following two phases. 

1. Exploration (r periods): For k = 1,2, ... ,r, play arm b^ G IA T given in Assumption QJb), 
and observe the reward X hk (c). Compute the OLS estimate Z(c) G W, given by 

(V \ C V / T \ _ C 7" 

E b fe b' fe £ £ b fe X b * ( a ) = Z + - PT b fc b' fc £ £ b fe W b * (a) , 
fc=l / 8=1 fc = l \fe=l / 8=1 k = l 

where for any k, X hk (s) and W hk (s) denote the observed reward and the error random 
variable associated with playing arm b^ in cycle s. Note that the last equality follows from 
Equation (JTJ) defining our model. 

2. Exploitation (c periods): Play the greedy arm G(c) = argmax vg ^ r v'Z(c) for c periods. 

Since XA r is compact, for each z£l r , there is an optimal arm that gives the maximum expected 
reward. When this best arm varies smoothly with z, we will show that the T-period regret and 
risk under the PEGE policy is bounded above by 0(r\/T). More precisely, we say that a set of 
arms W r satisfies the smooth best arm response with parameter J (SBAR(J), for short) condition if 
for any nonzero vector z G W \ {0}, there is a unique best arm u*(z) G IA T that gives the maximum 
expected reward, and for any two unit vectors z G W an y G W with ||z|| = ||y|| = 1, we have 

||u*(z)-u*(y)|| < J||z-y|| . 
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Even though the SBAR condi tion appears to b e an implicit one, it admits a simple interpreta- 



tion. According to Corollary 4 of lPolovinkinl (|1996l ). a compact set lA r satisfies condition SBAR( J) 



if and only if it is strongly convex with parameter J, in the sense that the set U r can be represented 
as the intersection of closed balls of radius J. Intuitively, the SBAR condition requires the bound- 
ary of U r to have a curvature that is bounded below by a positive constant. For som e examples 



Poloyinkin 



the un it ball satisfies the SBAR(l) condition. Furthermore, according to Theorem 3 of 
(|1996l ). an ellipsoid of the form {u £ R r : u'Q^u < 1}, where Q is a symmetric positive definite 



matrix, satisfies the condition SBAR yX max (Q,)/y/ A m j n (Q) 

The main result of this section is stated in the following theorem. The proof is given in Sec- 
tion EU 

Theorem 3.1 (Regret and Risk Under the Greedy Policy). Suppose that Assumption^ holds and 
that the sets lA r satisfy the SBAR(J) condition. Then, there exists a positive constant a\ that 
depends only on ctq, u, Xq, and J, such that for any z£l r \ {0} and T > r, 



Regret (z, T, PEGE) < ai (j|z|| + rVT 



Suppose in addition, that there exists a constant M > such that for every r > 2 we have E [ ||Z|| ] < 
M and E[l/ ||Z||] < M. Then, there exists a positive constant a 2 that depends only on 0*0, u, Xq, 
J, and M, such that for any T > r, 

Risk (T, PEGE) <a 2 rVT . 

Dependence on ||z|| in the regret bound: By Assumption QJb), for any z£l r , the instan- 
taneous regret under arm vGWis bounded by max ug ^/ z'(u — v) < 2u ||z||. Thus, 2u \\z\\ T provides 
a trivial upper bound on the T-period cumulative regret under the PEGE policy. Combining this 
with Theorem 13. 1\ we have that 

Regret(z, T, PEGE) < max{ai, 2u} • min | f ||z|| + jj~jj"J rVT, \\z\\t\ . 

The above result shows that the performance of our policy does not deteriorate as the norm of z 
approaches zero. 

Intuitively, the requirement E[||Z||] < M in Theorem 13.11 implies that, as r increases, the 
maximum expected reward (over all arms) remains bounded. Moreover, the assumption on the 
boundedness of E[1/||Z||] means that Z does not have too much mass near the origin. The 
following lemma provides conditions under which this assumption holds, and shows that the case 
of the multivariate normal distribution used in Theorem 12.11 is also covered. The proof is given in 
Appendix IA.2I 
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Lemma 3.2 (Small Mass Near the Origin). 

(a) Suppose that there exist constants Mq and p £ (0,1] such that for any r > 2, the random 
variable ||Z|| has a density function g : K + — > M + such that g{x) < Mqx p for all x G [0, p\. 
Then, E [1/ |Z|| ] < M, where M depends only on Mq and p. 

(b) Suppose that for any r > 2, the random vector Z has a multivariate normal distribution with 
mean 0£K r and covariance matrix I r /r. Then, E [ ||Z|| ] < 1 and E [ 1/ ||Z|| ] < y/ir. 

The following corollary shows that the example in Section [2] admits tight matching upper bounds 
on the regret and risk. 

Corollary 3.3 (Matching Upper Bounds). Consider a bandit problem where the set of arms is the 
unit sphere in M r , and where VK" has a standard normal distribution with mean zero and variance 
one for all t and u. Then, there exists an absolute constant a% such that for any z £ !R r \ {0} and 
T>r, 

Regret (z, T, PEGE) < a 3 ^||z|| + rVT . 

Moreover, if Z has a multivariate normal distribution with mean and covariance matrix Ir/r, 
then for all T >r, 

Risk (T, PEGE) < a 3 rVT . 

Proof. Since the set of arms is the unit sphere and the errors are standard normal, Assumption 
[T] is satisfied with gq = u = Aq = 1. Moreover, as already discussed, the unit sphere satisfies 
the SBAR(l) condition. Finally, By Lemma l3.2( the random vector Z satisfies the hypotheses of 
Theorem 13.11 The regret and risk bounds then follow immediately. □ 

3.1 Proof of Theorem SHI 

The proof of Theorem 13.11 relies on the following upper bound on the square of the norm difference 
between Z(c) and Z. 

Lemma 3.4 (Bound on Squared Norm Difference). Under Assumption^ there exists a positive 
constant h\ that depends only on ctq, u, and \q such that for any z£l r and c > 1, 

hi r 



Z(c) 



< 



Proof. Recall from the definition of the PEGE policy that the estimate Z(c) at the end of the 
exploration phase of cycle c is given by 

CT \ C T C 

E b * b EE b ^ bfc ( s ) = z + ^E BV ( s ) ■ 
k=l J 8=1 k=l s=l 
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where B = (££ =1 b fc b' fc )~ and V(s) = El=i h kW hk (s). Note that the mean-zero random vari- 
ables W hk (s) are independent of each other and their variance is bounded by some constant 70 
that depends only on gq. Then, it follows from Assumption [1] that 



Z(c) 



1£e[v( s )'bV( s )]=1££e 



s=l 



S=l fe=l 



TU bfc (s) 



b' fe B 2 b fc 



< 



k=i 



To 
c 



ib.ir < 



k=l 



Age 



which is the desired result. 



□ 



The next lemma gives an upper bound on the difference between two normalized vectors in 
terms of the difference of the original vectors. 

Lemma 3.5 (Difference Between Normalized Vectors). For any z, w £ M r , not both equal to zero, 



w 



w 



< 



2 llw 



max i z 



w 



where we define 0/ ||0|| to be some fixed unit vector. 

Proof. The inequality is easily seen to hold if either w = or z = 0. So, assume that both w and 
z are nonzero. Using the triangle inequality and the fact that | ||w|| — ||z|| | < ||w — z||, we have 
that 



w 



w 



< 



w 



w 



w 



+ 



w 



w 



+ z 



w 



1 



w 



< 



2 llw 



w 



By symmetry, we also have 



w z 

W M 



— ~HHT~^' w bich gives the desired result. 



□ 



The following lemma gives an upper bound on the expected instantaneous regret under the 
greedy decision G(c) during the exploitation phase of cycle c. 

Lemma 3.6 (Regret Under the Greedy Decision). Suppose that Assumption^ holds and the sets 
XA r satisfy the SBAR(J) condition. Then, there exists a positive constant hi that depends only on 
do, u, Xq, and J, such that for any z £ K r and c > 1, 

r /12 



max z' (u — G(c)) 



< 



c z 



Proof. The result is trivially true when z = 0. So, let us fix some z £ IR r \ {0}. By comparing the 
greedy decision G(c) with the best arm u*(z), we see that the instantaneous regret satisfies 

Z(c) -zVg(c) 



z' (u* (z) - G(c)) = (z - Z(c)J u* (z) + (u* (z) - G(c))' Z(c) 
< (z-Z(c))'u*(z) + (Z(c)-zYg(c) 

Z(c) - zV (G(c) - u* (z)) = fz(c) - zY ( 
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where the inequality follows from the definition of the greedy decision in Equation ([2]), and the 
final equality follows from the fact that G(c) = u* ^Z(c)^. As a convention, we define 0/ ||0|| to 
some fixed unit vector and set u*(0) = u*(0/ ||0||). 

It then follows from the Cauchy-Schwarz Inequality that, with probability one, 

z'(u*(z) -G(c)) < Z(c)-z 

2(c)- 



< j 



Z(c) 




where the equality follows from the fact that u*(z) = u*(Az) for all A > 0. The second inequality 
follows from condition SBAR(J), and the final inequality follows from Lemma 13.51 The desired 
result follows by taking conditional expectations, given Z = z, and applying Lemma 13.41 □ 

We can now complete the proof of Theorem 13. 1\ by adding the regret over the differnt times 
and cycles. By Assumption [1] and the Cauchy-Schwarz Inequality, the instantaneous regret from 
playing any arm u G U r is bounded above by max vg ^ r z' (v — u) < 2u ||z||. Consider an arbitrary 
cycle c. Then, the total regret incurred during the exploration phase (with r periods) in this 
cycle is bounded above by 2ur \\z\\. During the exploitation phase of cycle c, we always play 
the greedy arm G(c). The expected instantaneous regret in each period during the exploitation 
phase is bounded above by rfo/c ||z||. So, the total regret during cycle c is bounded above by 
2ur \\z\\ + h<ir j ||z||. Summing over K cycles, we obtain 



/ K \ K 

Regret z, rK + c, PEGE J < h 3 r ||z|| K + /t 4 ^ 

V c=l / c=l 



for some positive constants /13 and that depend only on ctq, u, Aq, and J. 



Consider an arbitrary time period T > r and z G W . Let Kq 



^2T 



. Note that the total 



time periods after Kq cycles is at least T because tKq + Ylc=i c — Ylc=i c 
Since the cumulative regret is nondecreasing over time, it follows that 

Regret (z, T, PEGE) < Regret ^z, rK + c, PEGE^ 



Kq(K +1) 
2 



K 2 

>^>T. 



< /13 r llzll Kq + /14- 



where the final inequality follows because Kq 



rKn 



< 3 max{/i3, /14} 



|z|| + — rVT , 
z 



< 3\/T. The risk bound follows by taking 



expectations and using the assumption on the boundedness of E[ ||Z|| ] and E[l/ ||Z|| ]. 
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4. A Policy for General Bandits 

We have shown that when a bandit has a smooth best arm response, the PEGE policy achieves 
optimal 0(ry/T) regret and Bayes risk. The general idea is that when the estimation error is small, 
the instantaneous regret of the greedy decision based on our estimate Z(c) can be of the same order 
as || Z — Z(c) || . However, under the smoothness assumption, this upper bound on the instantaneous 
regret is improved to O ^||Z — Z(c)|| 2 ^ , as shown in the proof of Lemma 13.61 and this enables us 
to separate exploration from exploitation. 

However, if the number of arms is finite or if the collection of arms is an arbitrary compact 
set, then the PEGE policy may not be effective. This is because a small estimation error may 
have a disproportionately large effect on the arm chosen by a greedy policy, leading to a large 
instantaneous regret. In this section, we discuss a policy - which we refer to as the Uncertainty 
Ellipsoid (UE) policy - that can be applied to any bandit problem, at the price of slightly higher 
regret and Bayes risk. In contrast to the PEGE policy, the UE policy combines active exploration 
and exploitation in every period. 

As d i scusse d in the introdu c tion, t he UE policy is closely related to the algorithms described 



m 



Auerl ([2002) and iDani et al.1 ()2008al ). but also has the "anytime" property (the policy does not 
require prior knowledge of the time horizon T), and also allows the random vector Z and the errors 
Wj u to be unbounded. For the sake of completeness, we give a detailed description of our policy and 
state the regret and risk bounds that we obtain. The reader can find the proofs of these bounds in 
Appendix IB! 

To facilitate exposition, we introduce a constant that will appear in the description of the policy, 
namely, 



Ko = 2^1 + log(l + ^) , (3 ) 

where the parameters u and Ao are given in Assumption [TJ The UE policy maintains, at each time 
period t, the following two pieces of information. 

1. The ordinary least squares (OLS) estimate defined as follows: if Ui,...,Uj are the arms 
chosen during the first t periods, then the OLS estimate Z^ is given bjo 

/ t \ - 1 t t 

C t = ^U S U' S , M t = ^U S W S , and % = C t ^U s X s = Z + C t M t . (4) 

\s=l / s=l s=l 



4 Let us note that we are abusing notation here. Throughout this section Zt stands for the OLS estimate, which is 
different from the least mean squares estimator E [Z | Ht] introduced in Section O 
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In contrast to the PEGE policy, whose estimates relied only on the rewards observed in the 
exploration phases, the estimate Zj incorporates all available information up to time t. We 
initialize the policy by playing r linearly independent arms, so that Ct is positive definite for 
t > r. 

2. An uncertainty ellipsoid St C M. r associated with the estimate Z t , defined by, 

S t = jw £ R r : w'C^w < (a \/logt \/min{rlogt, |Z4|}) 2 j and a = 4cr , (5) 

where the parameters o~o and k,q are given in Assumption QJa) and Equation ([3]). The uncer- 
tainty ellipsoid St represents the set of likely "errors" associated with the estimate Z^. We 
define the uncertainty radius Rf associated with each arm u as follows: 

i?" = max v'u = a \f\ogi ^min-fr logi , \U r \} ||u|| c . (6) 
ve£t 

A formal description of the policy is given below. 
Uncertainty Ellipsoid (UE) 

Initialization: During the first r periods, play the r linearly independent arms bi, \>2, ■ ■ ■ , b r 
given in Assumption QJb). Determine the OLS estimate Z r , the uncertainty ellipsoid S r , and the 
uncertainty radius associated with each arm. 
Description: For t > r + 1, do the following: 

(i) Let XJt £ U r be an arm that gives the maximum estimated reward over the ellipsoid Zt_i+£^i , 
that is, 

= argmax < v'Z t _i + max w'v > = argmax I v'Z t _i + RJ_i \ , (7) 

VgW r [ Wg£ f _l J veU r I ) 

where the uncertainty radius RJ_\ is defined in Equation ([6]); ties are broken arbitrarily. 

(ii) Play arm JJt and observe the resulting reward Xt- 

(hi) Update the OLS estimate Z^, the uncertainty ellipsoid St, and the uncertainty radius Rf of 
each arm u, using the formulas in Equations ([!]), ([5]), and ([6]). 

By choosing an arm that maximizes the estimated reward over the ellipsoid Zt + St, our pol- 
icy involves simultaneous exploitation (via the term v'Z^) and exploration (via the term RJ = 
max w6 £ t w'v) in every period. The ellipsoid St reflects the unce rtainty in our OL S estim ate Z^. 



It generalizes the classical upper confidence index introduced by lLai and Robbinsl (|1985l ) , to ac- 
count for correlations among the arm rewards. In the special case of r independent arms where 
U r = {ei, . . . , e r }, it is easy to verify that for each arm eg, the expression e^Z t + R^ 1 coincides (up 
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to a scaling constant) with the upper confidence bound used by lAuer et al.l (|2002l ). Our definition 



of the uncertainty radius involves an extra factor of W min{r log t , | U r | } , in order to handle the case 
where the arms are not standard unit vectors, and the rewards are correlated. 

The main results of this section are given in the following two theorems. The first theorem 
establishes upper bounds on the regret and risk when the set of arms is an arbitrary compact set. 
This result shows that the UE policy is nearly optimal, admitting upper bounds that are within a 
logarithmic factor of the Q(ry/T) lower bounds given in Theorem 12.11 Although the proof of this 
theorem makes use of somewhat different (and novel) large deviation inequalit ies for adap t ive leas t 



Dani et al. 



( 2008ah . 



squares estimators, the argument shares similarities with the proofs given in 
and we omit the details. The reader can find a complete proof in Appendix IB. 21 

Theorem 4.1 (Bounds for General Compact Sets of Arms). Under Assumption [7J there exist 
positive constants 0,4 and 05 that depend only on the parameters o~q, u, and Ao, such that for all 
T > r + 1 andze W, 

Regret (z,T,UE) < a 4 r ||z|| + o 5 r VT log 3/2 T , 



and 



Risk (T, UE) < a 4 r E [||Z||] +a 5 rVT log 3/2 T . 



For any arm u S lA r and z € W , let A u (z) denote the difference between the maximum expected 
reward and the expected reward of arm u when Z = z, that is, 



maxv'z — u'z . 



When the number of arms is finite, it turns out that we can obtain bounds on regret and risk that 
scale more gracefully over time, growing as log T and log 2 T, respectively. This result is stated in 
Theorem 14.21 which shows that, for a fixed set of arms, the UE policy is asym ptotically optimal 



as a f uncti on t i me, w ithin a constant factor of the lower bounds established by 
( 19851 ) and 



Lai and Robbins 



Lail (|1987I ). 



Theorem 4.2 (Bounds for Finitely Many Arms). Under Assumption^ there exist positive con- 
stants a$ and a-j that depend only on the parameters ao, u, and \q such that for all T > r + 1 and 



z G 



Regret (z, T, UE) < a$ \U r \ ||z|| +07 \U r 



. 1 logT „ 
nun { . .. ; ; , 1 A ( z , 

UGUr 



A u (z) 



Moreover, suppose that there exists a positive constant Mq such that, for all arms u, the distribution 
of the random variable A u (Z) is described by a point mass at 0, and a density function that is 
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bounded above by Mq on R + . Then, there exist positive constants a§ and ag that depend only on 
the parameters gq, u, \q, and Mq, such that for all T > r + \, 

Risk (T, UE) < a 8 \U r \ E [||Z||] + a 9 \U r \ 2 log 2 T . 

Proof. For any arm u GU r and z € K r , let the random variable N u (z, T) denote the total number 
of times that the ar m u is chosen durin g periods 1 through T, given that Z = z. Using an argument 

([2002]), we can show that 

4a 2 \U r \logT 



similar to the one in 



Auer et al. 

E[iV u (z,T) I Z = z] < 6 + 



(A"(z)r 

The reader can find a proof of this result in Appendix IB.31 

The regret bound in Theorem 14.21 then follows immediately from the above upper bound and 
the fact that N u (z,T) < T with probability one, because 

Regret (z, T, UE) = £ A u (z) E [iV u (z, T) | Z = z] < £ A u (z) min k + i^Mg^ , A 

uew ueUr L ( Au ( z )) J 

< 6^A u (z)+max{4a 2 ,l}|a| ^ min { , TA U (z) j , 

and the desired result follows from the fact that A u (z) = max v ew r ( v — u) z < 2u ||z||, by the 
Cauchy-Schwarz Inequality. 

We will now establish an upper bound on the Bayes risk. From the regret bound, it suffices to 
show that for any u£W r , 

logT 



mm 



-,ta u (z; 



< (Mo + l)logT + Molog^T 



A U (Z) : 

Let (?"(•) denote the density function associated with the random variable A u (Z). Then, 



mm 



A U (ZV 



TA U (Z' 



/logT 
T 



mm 



logT 



+ 



logT 
T 



.mm 



x 

logT 

x 



Tx > q u (x)dx 



Txj q u (x)dx + min|i^,Tx|g u 



(x)dx . 



We will now proceed to bound each of the three terms on the right hand side of the above 
equality. Having assumed that q u (-) < Mq, the first term satisfies 

-2 ^{\o g T)/T 



L 



y/0-ogT)/T 



mm 



logT 



,Txj q u (x)dx < M Q J 



Tx dx = M T— 



2 



< M logT. 



For the second term, note that 



f mini-^ — ,Tx\ q u (x)dx < Mq [ 

J^(logT)/T { X J J x 



logT / 

-dx = Mq log T • log x 



y/(logT)/T X 

Mo(logT). logT i 0gl0gT <M log 2 T, 



y/ (log T)/T 
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where the last inequality follows from the fact that logT — log log T < 21ogT for all T > 2. To 
evaluate the last term, note that < logT for all x > 1, and thus, min | lo | r ,T:r| q u (x)dx < 



< 



logT q u (x) < logT . Putting everything together, we have that E min I a^(z) ' (Z)| 
(Mo + 1) log T + Mo log 2 T, which is the desired result. □ 

We conclude this section by giving an example of a random ve ctor Z that satisfies the condition 



in Theorem 14.21 A similar example also appears in Example 2 of lLail (|1987l ) 



Example 4.3 (IID Random Variables). Suppose U r = {e±, . . . ,e r } and Z = (Z\, . . . , Z r ), where 
the random variables Z k are independent and identically distributed with a common cumula- 
tive distribution function F and a density function / : R — > R which is bounded above by M. 
Then, for each k, the random variable A 6fc (Z) is given by A Gfe (Z) = (maxj = i r . iir Zj) — Z^ = 
max{0 , maxj^fc {Zj — Z k }} . It is easy to verify that A Gfe (Z) has a point mass at and a con- 
tinuous density function on M + given by: for any x > 0, 

g fc (a;) = (r - 1) / {T(z fc + x)} r ~ 2 /(z fc + x)f{z k )dz k < (r - 1)M . 



4.1 Regret Bounds for Polyhedral Sets of Arms 

In this section, we focus on the regret profiles when the set of arms lA r is a polyhedral set. Let 
£(U r ) denote the set of extreme points of U r . From a standard result in linear programming, for 
all z e W, 



maxu'z = max u'z . 



ueW ue£(U r ) 

Since a polyhedral set has a finite number of extreme points (\£ (U r )\ < °°)> the parameterized 
bandit problem can be reduced to the standard multi-armed bandit problem, where each arm 
corres ponds to an extreme point of lA r . We can thus apply the algorithm of 



Lai and Robbins 



(|1985l ) and obtain the following upper bound on the T-period cumulative regret for polyhedra 



Regret (z, T, Lai's Algorithm) = 0[ rr^r r ] , (8) 

B v ' ' B ' V mm { Au ( z ) : Au ( z ) > o}y 

where the denominator corresponds to the difference between the expected reward of the optimal 



and the second best extreme points. The algorithm of lLai and Robbins! (|1985l ) is effective only when 



the polyhedral set lA r has a small number of extreme points, as shown by the following examples. 

Example 4.4 (Simplex). Suppose^ = {u G W : Y^i=i \ u i\ — 1} i s an ^-dimensional unit simplex. 
Then, lA r has 2r extreme points, and Equation ([8|) gives an O(rlogT) upper bound on the regret. 
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Example 4.5 (Linear Constraints). Suppose that U r = {u G W : Au < b and u > 0}, where A 
is a p x r matrix with p < r. It follows from the standard linear programming theory that 
every ext reme point is a basic feasible so lution, which has at most p nonzero coordinates (see, for 



Bertsimas and Tsitsiklis 



1997). Thus, the number of extreme points is bounded above 



example, 

by ( r p P ) = 0((2r) p ), and Equation (jHJ) gives an 0((2r) p logT) upper bound on the regret. 

In general, the number of extreme poin ts of a polyhedron can be very large, rendering the 



bandit algorithm of lLai and Robbins! (|1985l ) ineffective; consider, for example, the r-dimensional 
cube lA r = {u £ W : \m\ < 1 for all i}, which has 2 r extreme points. Moreover, we cannot apply 
the results and algorithms from Section [3] to the convex hull of U T . This is because the convex 
hull of a polyhedron is not strongly convex (it cannot be written as an intersection of Euclidean 
balls), and thus, it does not satisfy the required SBAR(-) condition in Theorem 13. 11 The UE policy 
in the previous section gives 0(ry/T log 3 / 2 T) regret and risk upper bounds. However, finding an 
algorithm specifically for polyhedral sets that yields an 0(ry/T) regret upper bound (without an 
additional logarithmic factor) remains an open question. 

5. Conclusion 

We analyzed a class of multiarmed bandit problems where the expected reward of each arm depends 
linearly on an unobserved random vector Z £ l r , with r > 2. Our model allows for correlations 
among the rewards of different arms. When we have a smooth best arm response, we showed that 
a policy that alternates between exploration and exploitation is optimal. For a general bandit, we 
proposed a near-optimal policy that performs active exploration in every period. For finitely many 
arms, our policy achieves asymptotically optimal regret and risk as a function of time, but scales 
with the square of the number of arms. Improving the dependence on the number of arms remains 
an open question. It would also be interesting to study more general correlation structures. Our 
formulation assumes that the vector of expected rewards lies in an r-dimensional subspace spanned 
by a known set of basis functions that describe the characteristics of the arms. Extending our work 
to a setting where the basis functions are unknown has the potential to broaden the applicability 
of our model. 
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A. Properties of Normal Vectors 

In this section, we prove that if Z has a multivariate normal distribution with mean £ K r and 
covariance matrix I r /r, then Z has the properties described in Lemmas 12.41 and 13.21 

A.l Proof of Lemma 12.41 

We want to establish a lower bound on Pr{# < ||Z|| < f3}. Let Y = (Yy, . . . ,Y r ) denote the 
standard multivariate normal random vector with mean and identity covariance matrix I r . By 
our hypothesis, Z has the same distribution as Y/i/r, which implies that 

Pr{6> < ||Z|| < [3} = Pr {9y/r~ < \\Y\\ < f3^r] = 1 - Pr |||Y|| 2 < 6> 2 rj - Pr |||Y|| 2 > /3 2 r| . 

By definition, ||Y|| 2 = Y 2 + • • • + Y 2 has a chi-square distribution with r degrees of freedom. By 

< E [||Y|| 2 j /(/3 2 r) = 1//3 2 . We will now establish an 



the Markov Inequality, Pr 



|Y|| 2 > 



/3 2 r} 



upper bound on Pr |||Y|| 2 < # 2 r|. Note that, for any A > 0, 



Pr {||Y|| 2 < 9 2 r] = Pr { e ~^=in 2 > e -A^j < e ^r 



■ E 



n 

k=l 



e~ XY ? 



,a<9 2 



V1 + 2A 



where last equality follows from the fact that Yi , . . . , Y r are independent standard normal random 

+ 2A for A > 0. Set A = l/(9 2 , and use the facts 6* < 1/2 < 



variables and thus, E 
\/2/e and r > 2, to obtain 



Pr 



IYII 2 < 6 2 r\ < 



c9 



< 



e9_ 
V2 



< 



eO\- 
V2j 



2o2 



e z 6 



< A9 Z 



.V2 + W, 

which implies that Pr {6 < ||Z|| < (3} > 1 — — Ad 2 , which is the desired result. 
A. 2 Proof of Lemma EO 



For part (a) of the lemma, we have 



E[1/||Z| 



—g(x) dx < 







M / x p ~ l dx + - g 
Jo P Jp 



1 



(x)dx < M — + - 
P P 



For the proof of part (b), let Y = (Y\, . . . ,Y r ) be a standard multivariate normal random vector 
with mean and identity covariance matrix, I r . Then, Z has the same distribution as Y/y^r. Note 
that ||Y|| 2 has a chi-square distribution with r degrees of freedom. Thus, 

1 



E[||Z||] = ^E[||Y||]<^ 



EfllYI 



<r = 1 



We will now establish an upper bound on E[l/ ||Z|| ] = y/r E[l/ ||Y|| ]. For r = 2, since ||Y| 
has a chi distribution with two degrees of freedom, we have that 

/*00 -1 /'OO 

E[1/||ZI|] = V2 / - • xe~ x2 / 2 dx = V2 / e~ x2/2 dx = y^F . 
Jo x Jo 
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Consider the case where r > 3. Then, 



E[1/||Z||] = v^E[l/||Y||] < v^V^VHYir] . 
Using the formula for the density of the chi-square distribution, we have 
E[1/||Y|| 2 ] = r°-- — !— - xW-'e-^dx 



r((r/2) - 1) y« 1 x( f,-2)/ 2 )- lc -,/ 2 ds 

2-/2 T(r/2) J 2(r-V/iT((r-2)/2) 



1 1 3 

< 



2((r/2) - 1) r - 2 ~ r ' 

where the third equality follows from the fact that T(r/2) = {{r/2) — 1) - T((r/2) — 1) for r > 3 and 
the integrand is the density function of the chi-square distribution with r — 2 degrees of freedom and 
evaluates to 1. The last inequality follows because r > 3. Thus, we have E[l/ ||Z|| ] < v3 < y/n, 
which is the desired result. 
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B. Proof of Theorems 14.11 and 14.21 

In the next section, we establish large deviation inequalities for adaptive least squares estimators 
(with unbounded error random variables), which will be used in the proof of Theorems 14,11 and 14.21 
given in Sections IB, 21 and IB.31 respectively. 

B.l Large Deviation Inequalities 

The first result extend the standard Chernoff Inequality to our setting involving uncertainty ellip- 
soids when we have finitely many arms. 

Theorem B.l (Chernoff Inequality for Uncertainty Ellipsoids with Finitely Many Arms). Under 
Assumption^ for any t > r, x £ IR r , z 6 W , and C, > 0, 

Pr{x'(Zi-z) > CofclMIc, |Z = z} < t ^\er?l\ 



and 



Pr{(U m -x)'(Z 4 -z) > C^o||U m - x || Ct |Z = z} < fW e -C 2 /2 . 



Proof. We will only prove the second inequality because the proof of the first one follows the same 
argument. If the sequence of arms Ui,U2,... is deterministic (and thus, the matrix Ct is also 
deterministic), then 



(U m - x)' (Z t - zj _ t (Ut+i _ x)/ QUg ^ 
l|U t+ i-x|| Ci ^ ||U m -x|| Ct 



8=1 

and 



A / (Um-x/Qu. y _ (u t+ i - x) r Ct (EU UgUQ c t (u t+1 - x) = i 

||U m -x|| Ct " (Ut+i -x/Q (Ut+i -x) 

The classical Chernoff Ineq uality for the sum of independent random variables (see, for example, 



Chapter 1 in iDudleyl . 



19991 ) then yields 



Pr{(U m -x)'(Z t -z) > (a l|Ut+i-x| 



< exp s =e-C 2 / 2 . 

I 2a 2 £< =1 ((U m -x)'QU s / ||Ut+i-x|| Ct ) 2 J 

In our setting, however, the arms U are random variables that depend on the accumulated history, 
and we cannot apply the standard Chernoff inequality directly. If N u (z, t) denotes the total number 
of times that arm u has been chosen during the first t periods given that Z = z, then 

C t = (j2 V S XJ> S ) =(j2 N»(z, t) uu'] , 
\S=1 ) \ueUr J 
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which shows that the matrix Ct is completely determined by the nonnegative integer random vari- 
ables N u (z, t). Since < N u (z, t) < t, the number of possible values of the vector (iV u (z, t) :u£ lA r ) 
is at most i} Ur K It then follows easily that the number of different values of the ordered pair 
(Ut+i, Ct) is at most \U r \ i' Wr ' < i 5 l Wr, l. To get the desired result, we can then use the union bound 
and apply the classical Chernoff Inequality to each ordered pair. □ 



When the number of arms is infinite, the bounds in Theorem IB. II are vacuous. The following 
theorem provides an extension of the Chernoff inequality to the case of infinitely many arms. 

Theorem B.2 (Chernoff Inequality for Uncertainty Ellipsoids with Infinitely Many Arms). Under 
Assumption^ for any t > r, x € IR r , z 6 W , and C, > 2, 



Pr 



z) > (k do Vlogt ]|x|| Cf 



< t rK ° e 



-C 2 /4 



and 



Pr|(U t+ i -x)' (z t -z\ > (k ct \/\ogi ||U m - x| 



The proof of Theorem IB. 21 makes use of the following serie s of lemmas. T 
establishes a tail inequality for a ratio of two random variables, 
proof of this result in Corollary 2.2 (page 1908) of their paper . 



< t r*l e -C74 . 

re first lemma 



De la Pena et al. 



((200J) gave a 



De la Peha et al 



Lemma B.3 (Exponential Inequality for Ratios, 
random variables such that B > with probability one and E 
Then, for all £ > y/2 and y > 0, 



20041 ). Let A and B be two 



o7 A-( 7 2 S 2 /2] 



< 1 for all 7 £ 



Pr 



{m > C ^ + ! ,)( 1 + ilo g ( 1 + f))} 



< e 



-C 2 /2 



Recall from Equation that Mf = X^s=i UsW s is the martingale associated with the least 
squares estimate *L%. The next lemma establishes a martingale inequality associated with the 
inner product x'M; for an arbitrary vector x £ W . This result is based on Lemma IB. 31 with 
A = ^ = E* s =i {j ^f 1 W s and B = He-! = Jx'Ct 1 * = a/E^i (x'U s ) 2 . We then use upper 



and lower bounds on B to establish bounds on the term log ^1 + —J , for a suitable choice of y, 
giving us the desired result. 

Lemma B.4 (Martingale Inequality). Under Assumption^ for any x G W , t > 1, and £ > y/2, 
Prjlx'rVL) > C«o^o0oit||x|| c -i} = Prjx'MtM^x > C 2 ^*o(logt) (a/C^x)} < e~^ /2 . 
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Proof. Let x € R r and t > 1 be given. Without loss of generality, we can assume that ||x|| = 1. 
Let the random variables A and B be defined by 



A 



x'M t 



and B 



£ (x'U s ) 2 . 



For any s, let H s = (Ux, X\, W\, . . . , TJ S ,X S , W s ) the history until the end of period s. By definition, 
U s is a function of H s _i, and it follows from Assumption QJa) that for any 7 e R, 



e _L (x ' Us )^ 



H s _i 



e ^(x'U s )H/ s 



H 



s-1 



< 1 . 



Using a standard argument involving iterated expectations, we obtain 



ol A-{^B*/2) 



n 



2-(x'U 3 )W/ s 

r 



7 2 (x' Ua ) 2 



< 1 



We can thus apply Lemma lB.3l to the random variables A and B. Moreover, it follows from the 
definition of u and Ao in Assumption QJb) that, with probability one, 

A < A min (j2 U S U;) < x' [J2 U S U'J x = B 2 = (x'U s ) 2 < tv 2 . 



s=l 



Therefore, B 2 + A < 2B 2 , and 



1 / 5 2 

1 + 2 1o < 1+ a7 



1 



tu 4 



f 2 iog i i+ ^j £ K iog(+2+iog ( i+ ^ 



< 



logt / 2 + log (1 + (u 2 /X )) \ K 2 logt 
2 ^ + " logt )~ 2 < 



where the last inequality fol lows from the definition of kq a nd the fact that t > r > 2. These two 
upper bounds imply that J (B 2 + Ao) ^1 + \ log ^1 + j < Ko\/log i 1? . Therefore, 

Pr{|x'M t | >C^oc OV / loit||x|| cr i} < Pr||A| >C^(S 2 + A ) (l + ^log(^l + ^^| , 



and the desired result then follows immediately from Lemma IB, 31 



□ 



The next and final lemma extends the previous result to show that the matrix £ 2 Kq a 2 , (log t) Cj" 1 - 
MtMj is positive semidefinite with a high probability. The proof of this result makes use of the 
fact that for the matrix Q 2 KqOq (logt) C^ 1 — MtM' ( to be positive semidefinite, it suffices for the 
inequality x'MtMjx < £ 2 Kf] Oq (logt) x'C^x to hold for vectors x in a sufficiently dense subset. 
We can then apply Lemma lB.41 for each such vector x and use the union bound. 
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Lemma B.5. Under Assumption^ for any t > r and ( > 2, 

Pr{M 4 M^ < C 2 «g <rg (logtjCr 1 } > l-f K oe-< 2 /\ 
Proof. Let S r = {x £ R r : ||x|| = 1} denote the unit sphere in M r . Let 5 > be defined by: 

sr _ Aq 

~ 9uH ' 

where the constants Ao and u are given in Assumption QJb) . Without loss of generality, we can 
assume that 6 < 1/2 and that 1/5 is an integer. Let X r be a covering of S r , that is, for any x 6 5 r , 
there exists y 6 Af r such that ||x — y|| < 5. It is easy to verify that X r can be chosen to have a 
cardinality of at most (2y / r/5) r because we can consider a rectangular grid on [—1, l] r with a grid 
spacing of 5/y / r. Then, for any point x € S r , there is a point y on the rectangular grid such that 
the magnitude of each component of x — y is at most 5/y/r, which implies that ||x — y|| < 5. 

Let t > r and ( > 2 be given. To facilitate our exposition, let f3 = £ 2 Kq erg logi. Let Q denote 
the event that the following inequalities hold: 

ejMtMje, < foCr 1 *, i = 1, 2, . . . , r , and y'M t M^y < ^y'C^y , V y € X r . 

Using the union bound, it follows from Lemma lB.4l that the event Q happens with a probability at 
least 

l_|^| e -CV4_ re -C 2 /2 > !_^y e -C 2 /4_ re -C 2 /2 > ^ffh/rj + r \ e -tV* 

> i_^V e -C 2 /4 > l-(^^)V^4 > i_r^ e -C 2 /4 ; 

where we have used the fact that t > yfr > 2 in the penultimate inequality. The final inequality 
follows from the definition of kq in Equation ([3]), which implies that Kq > 4 (l + log(36u 2 /Ao)) > 4, 
and thus, ^-g^ < iVo/4 < *«§/2 (f2)"g/4 = ^ _ 

To complete the proof, it suffices to show that when the event Q occurs, we have that x'M^Mjx < 
for all x € S T . Consider an arbitrary x € S r , and let y € Af r be such that ||x — y|| < 5. 
This implies that ||x + y|| < ||2x|| + ||y - x|| < 2 + 5 < 3. Moreover, x'M t M£x - y'MtMjy = 
(x - y) M t M' t (x + y) < 35 \\M t \\ where we use the Cauchy-Schwarz for the last inequality. 

Similarly, we can show that for all s, y'U s U(,y < x'U s U^x + 35"u 2 . Summing over all s, we 
obtain that y'C^ y < xC^x + 35tu 2 . Putting everything together, we have that 

x'M,M;x < y'rvLM^y + 35 ||M t || 2 < tyC^y + 35 ||Mt|| 2 
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where the last inequality follows from the fact that C t 1 = X^s=i U S U^ > AoI r from our definition 
of Ao- Finally, note that under the event Q, 

r r t r 



i=l 



i=l 



s=l i=l 



t r 



/?E E l e * u 4 = ^E h u *h 2 ^ ^ 2 > 



8=1 1=1 



which implies that 



l Xo + ^5M 2 + 35 ||M,|| 2 < -^A + ^Stu 2 = | (95tu 2 - A ) = , 



where the last inequality follows from the definition of 5. Thus, we have that x'M(Mjx < /3x'C f x, 



which is the desired result. 



□ 



We are now ready to give a proof of Theorem IB. 21 



Proof. It suffices to establish the second inequality in Theorem IB. 21 because the proof for the first 
inequality follows the same argument. It follows from the Cauchy-Schwarz inequality that 



(U m -x) 



(u m -xycrc 



/ ~l/2„-l/2 



ll u m -x|| Ct 
with probability one. Therefore, 



c f 1/2 (u t+1 - x ; 



< 



c; 1/2 (z t 



Pr{(Ut + i-x)'(z t -z) > CKoa ov /bgl||Ui + i-x|| Ct |Z = z} 



< Pr 



c; 1/2 ( z t - 



> C k a ^/\ogi | Z = z| 



Pr | {Zt - zj C^ 1 {Zt - zj > ( 2 K 2 a 2 logt\Z = z 
Pr{M;C t M t > C 2 ^o CT o log* | Z = z} , 



where the last equality follows from the definition of the least squares estimate Zt ■ 
It is a well-known result in linear algebra (see, for example, Theorem 1.3.3 in 



Bhatial . 



20071 ) 



A X 



that if A and B are two symmetric positive definite matrices, then the block matrix 

X' B 

is positive semidefinite if and only if XB ^X' < A. Applying this result to the two "equivalent" 

C 2 ^a 2 logi M' t \ / C,- 1 M t 

M t C^ 1 ) mi ^ M> t C 2 ^cr 2 logt 

that M^CiMi < C 2 t4 ol logf if and only if MjMj < C 2 «o °o ( lo g*) c t _1 - The desired result then 



(r + 1) x (r + 1) matrices 



we conclude 
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follows from the fact that 

Pr {M' t C t M t > C 2 Kq(Xo log* | Z = z} = 1 - Pr {M^QMt < C 2 ^ a 2 , log t | Z = z} 

= l-Pr{M t Mj < C 2 Ko<7o(log*)C t -1 | Z = z} 

where the last inequality follows from Lemma lB.51 □ 

B.2 Bounds for General Compact Sets of Arms: Proof of Theorem 14.11 

The proof of Theorem 14.11 makes use of a number of auxiliary results. The first result provides a 
motivation for the choice of the parameter a in Equation ([5]) and our definition of the uncertainty 
radius Rf in Equation (JHJ). They are chosen to keep the probability of overestimating the reward 
of an arm by more than Rf bounded by 1/t 2 . This will limit the growth rate of the cumulative 
regret due to such overestimation. 

Lemma B.6 (Large Deviation Inequalities for the Uncertainty Radius). Under Assumption^ for 
any arm u £U r and t > r, 

Pr{u'(z t -z) >Rf | Z = z} < 1 , 

and for any x € W , 

Pr{(U t+ i - x)' (% -z) > a Tbgl ^/min{rlogt, \U r \} \\TJ t+1 - x|| C( | Z = z} < 1 , 
where a = 4<toKo- 

Proof. It suffices to establish the first inequality because the proof of the second one is exactly 



the same. Let j3 t = 4<7oKo y/logt ^minjrlogt, \U r \}. Recall from Equations © and © that 
Rf = j3 t ||u|| c . By applying Theorem lB.il (with ( = 4kq \/log t ymunjrlogt, \V( r \}) and Theorem 



IB.2I (with £ = 4ko ^minjrlogt, |^r|}), we obtain 

Pr ju'^Zt-z^ > Rf | Z = z| < min |t 5 ' Wr 'e" 8K ^ log ^ min ^ rlog '' '^'j' , f -K o e - 4K o min {'~logt,|Wr|} j 
There are two cases to consider: rlogt > \U r \ and rlogt < \U r \. Suppose that rlogi > \U r \. Then, 

Pr |u (Z t -z) > Rt | Z = Z j < t5|^l e -8^(log<)min{rlogt,|W r |} = f 5\U r | e -8/tg(log t)|W r | = . < J_ ? 

where the last inequality follows from the fact that (8kq — 5) \U r \ > 2. In the second case where 
rlogi < \U r \, we have that 



r K o i i 



Pr/u' (% -z) > iff I Z = zj < ^ e -4^min{rlogi,|W r |} = jrKjj -4/sgrlogt = = _L_ < 

where the last inequality follows from the fact that 3rKp > 2. Since the probability is bounded by 
1/t 2 in both cases, this gives the desired result. □ 
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For any t > 1, let the random variable Qt(z) denote the instantaneous regret in period t given 
that Z = z, that is, 

Qt{z) = maxv'z-Ujz. (9) 

Lemma lB.61 shows that the probability of a large estimation error in period t is at most O (l/i 2 ). 
Consequently, as shown in the following lemma, the probability of having a large instantaneous 
regret in period t is also small. 

Lemma B.7 (Instantaneous Regret Bound). Under Assumption^ for all t > r and z £ R r ; 

Pr <|Q m (z) > 2a y/fogiy/rain{r\ogt,\U r \} ||U m || Ct | Z = z} < 1 . 

Proof. Let z € W be given and let w denote an optimal arm, that is, max v g« r v'z = w'z. To 
facilitate our discussion, let fit = a. -v/logi-y/minjr logt, |Z^|}. Then, it follows from the definition 
of the uncertainty radius in Equation © and the definition of XJt+i in Equation ([7|) that 

U' t+l Z t + p t ||U m || Ci > w% + Pt ||w|| Ci , 

which implies that 

/3 4 ||Ut+i|| Ct > (w-Ut + i)'Z f + P t ||w|| C( 

= (w - Ut + i)' z + (w - U t+ i)' fz t - z) + A ||w|| Ct 
= Q m (z) + (w-U m )'(z t -z) + ft ||w|| Ct . 

Suppose that the event Qt+i{z) > 2(3 t \\XJt+i\\ Ct occurs. Then, it follows that 

/3 t \\V t+1 \\ Ct > 2fo ||U t+ i|| Ct + (w-U t+ i)' (Z t -z) +ft ||w|| Ct 

which implies that (U t+ i - w)' \ Z t - zj > fit (||U t +i||c t + ||w|| c J > /3 t ||U t+ i - w|| Ct . Thus, 

Pr{Q t+1 (z) >2p t ||U m || Ct | Z = z} 

< Pr{(U t+1 -w)'(z t -z) > Pt ||U t+ i - w|| Ct | Z = z} < 1 , 

the last inequality follows from Lemma lB.61 □ 

Lemma IB.7I suggests the following approach for bounding the cumulative regret over T peri- 
ods. In the first r periods (during the initialization), we incur a regret of 0(r). For each time 
period between r + 1 and T, we consider the two cases: 1) where the instantaneous regret is large 



with Q t+ i(z) > 2a \/\ogty/mm{r logi, \U r \} ||Ut+i|| c ; and, 2) the instantaneous regret is small. 
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By the above lemma, the contribution to the cumulative regret from the first case is bounded 
above by 0(^2 t l/t 2 ^, which is finite. In the second case, we have a simple upper bound of 
2a ^Jr (logi) ||Ut+i||c for the instantaneous regret. This argument leads to the following bound 
on the cumulative regret over T periods. 

Lemma B.8 (Regret Decomposition). Under Assumption^ for all T > r + 1 and z E W , 



Regret (z,T,UE) < 2 u(r + 2) ||z|| +2a^ (logT) Vf E 



N 



T-l 



t=r 



Z = z 



Proof. Let z £ W be given. By the Cauchy-Schwarz Inequality and Assumption QJb), we have the 
following upper bound on the instantaneous regret for all t and z £ R r : Qt(z) = max v6 ^ r (v — Uj)' z < 
2u ||z||. Therefore, 

VT-l 



Regret (z,T,UE) < 2ur ||z|| + E 



t=r 



Z = z 



For any t >r, let the indicator random variable Gt+i(z) be defined by: 



G m (z) = 1 \Q t +i{z) < 2a y/]ogiy/mia{r\ogt, \U r \} ||U m || Cl 

The contribution to the expected instantaneous regret E [Qj+i(z) | Z = z] comes from two cases: 
1) when Gt+i(z) = and 2) when Gt+i(z) = 1. We will upper bound each of these two contri- 
butions separately. In the first case, we know from Lemma lB.71 that Pr{Gt+i(z) = | Z = z} = 
Pr{Q m (z) > 2a v / fogV 

min{r logt, \U r \} ||Ut+i|| Ct | Z = z| < 1/i 2 . Since YluLi V* 2 - 2, we 

have that 



T-l 



^(l-G m (z))Q m (z) 



t=r 



Z = z 



T-l 



< 2u||z|| J^Pr \G t+ i(z 







Z = z } < 4u llzl 



t=r 



On the other hand, when Gf+i(z) = 1, we have that (5<+i(z) < 2ay / r(logi) ||Ut_|_i|| c . This implies 
that, with probability one, 

T-l T-l 

^G m (z)Q m (z) < 2a^^(logi)||U m || Ct < 2a ^ 



T-l 



]Tiog 2 t 



t=r 



T-l 



2 

t+lllCt 



< 2a ^ (log T) \/T 



T-l 



Eii" 



|2 

t+illc* 



where we use the Cauchy-Schwarz Inequality in the second inequality and the final inequality 



follows from the fact that J Ya=v lo g 2 1 ^ \JYa=i 1o ^ t ^ ( lo S T ) VT . Putting the two cases 
together gives the desired upper bound because 

{T-l 



t=r 



Z = z 



< 4u ||z|| + 2a (logT) VT E 



N 



T-l 

Eli" 



|2 

t+HlCt 



t=r 



Z = z 
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□ 

The eigenvectors of the matrix Ct = (X^s=i U S U' S ) 1 reflect the directions of the arms that are 
chosen during the first t periods. The corresponding eigenvalues then measure the frequency with 
which these directions are explored. Frequently explored directions will have small eigenvalues, 
while the eigenvalues for unexplored directions will be large. Thus, the weighted norm ||Ut+i|| c 
has two interpretations. First, it measures the size of the regret in period t + 1. In addition, since 
||Ut+i|| Ct is a linear combination of the eigenvalues of Ct, it also reflects the amount of exploration 
in period t + 1 in the unexplored directions. 

The above interpretation suggests that if we incur large regrets in the past (equivalently, we have 
done a lot of exploration), then the current regret should be small. Our intuition is confirmed in 
the following lemma that establishes a recursive relationship between the weighted norm ||Ut+i|| Ct 
in period t + 1 and the norms in the preceding periods. 

Lemma B.9 (Large Past Regrets Imply Small Current Regret). Under Assumption^ for allt > r 
and z£K r , with probability one, 

n^HTT ||2 <^ A IITT II 2 < {(^/AoMt + l)}' 

< ||U m || Ct < — and ||U m || Ct < 



An 11 " rrt-i ri _•_ IITT . _ ii=b 



n:=Hi+iiu s+1 ir Cs 

Proof. For any t>r, let Tf = (Ct) -1 = X)s=i U S U' S . By the Rayleigh Principle 



||U || 2 v? 

|Ut+i|| c . = U^ +1 C t U i+ i < A max (C t ) ||Uf+i|| = - — < — 



where the last inequality follows from the definition of u and the fact that 

Amin = \nin I T r + ^ U S U' S J > A m ; n (T r ) > An , 

V s=r+l J 

where the last equality follows from the fact that T r = Y2k=i ^°^°'k where the vectors bi, . . . , b r 
are given in Assumption QJb). This proves the claimed upper bound on ||Ut+i|| c ■ 

1 1 2 1 1 2 

We will now establish the inequality that relates ||Ut+i|| c to ||U s +i|| c for s < t. Note that 

HUt+xll^ = U' m CtUt +1 <l + u; +1 CtU m 

det (Tt) • (1 + U' f+1 CtUt+i) _ det (T t + U^iU^) _ det (r t+1 ] 



det (Tt) det (T t ) det (T 4 ) 

where the second to last equality follows the matrix determinant lemma. 

We will now establish bounds on the determinants det (Yt+i) and det (Tt). Note that 

i+l t+i 
.t+D < tr(T m ) = J2 tr (VsU's) = IIU-II 2 ^ (* + !)« 2 . 

8=1 8=1 
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(10) 



where the last inequality follows from the definition of u . Therefore, det (T 4+1 ) < [A max (T t+1 )] r < 
(t + l) r u 2r . Moreover, using Equation (fT0|) repeatedly, we obtain 

t-i t-i 
det (T t ) = det(T r ) JJ (l + ||U fl+1 ||^) > A£ JJ (l + ||U S+1 ||^ 

s=r s=r 

where the last inequality follows from the fact that T r = X^fe=i ^k^'k anci det (T r ) > [A m i n (T r )] r > 
Aq, where the vectors b l5 . . . ,b r and the parameter Ao are defined in Assumption [TJb) . 
Putting everything together, we have that 

2 < det (Y m ) < (t + l) r u 2r _ {(u 2 /X ) ■ (t + l)} r 



which is the desired result. □ 

The above result shows that if the weighted norms in the preceding periods, as measured by 
n^=r + l|U s +i|lc a ) j are large, then the weighted norm in the current period t+1 will be small. 
Moreover, since the weighted norm in the current period depends on the product of the norms in 
the past, we hope that the growth rate of the sum Ylt^ l|Ut+i||c should be small. To formalize 
our conjecture, we introduce a related optimization problem. For any c > and t > 1, let V*(c,t) 
be defined by: 

t 

V*(c,t) = max ^2y s 

8=1 

{c- (r + s)} r 

s.t. < y s < c and y s < , - , s = l,2,...,t, 

n 9 =i c 1 + y?) 

where we define J3^ =1 (l + y q ) = 1. The following lemma gives an upper bound in terms of the 
function V* . 

Lemma B.10 (Bounds on the Growth Rate of ||Ut+i|| c ). Under Assumption^ for any T > r + 1 

and z € W , with probability one, Ylt=r ll^t+illc — ^* ("" 2 /Ao ; T — r) . 

Proof. For all s > 1, let y s = ||U r + s ||c r+a -r Then, ll u mllct = Es=T f«- Let c o = 

u 2 /\q. It follows from Lemma IB. 91 that for all s, with probability one, < y s < Co and y s < 
{c (r + s)} r / (1 + y q ). Therefore, we have Y%=r H U m f Ct <V*(c ,T-r). □ 

It follows from Lemma IB. 101 that it suffices to develop an upper bound on V*(c,t). This result 
is given in the following lemma. 

Lemma B.ll (Optimization Bound). For all c > 0, and t > 1, 

V*(c,t) <2c (rlogc + (r + l)log(r + t+l)) , 

where cq = max{l, c}. 
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Proof. Any feasible solution {y s : s = 1, . .. ,t} for the problem defining V*(c, t), also satisfies the 
constraints 



Q< v L< y. <1 and {<*-(r + *)} T 



< {c - (r + s)} r e- E ?=i^ /(2co) , 



2c -co~- 2c " nS(l + (y 9 /co)) 

where the last inequality follows from the fact that for any a £ [0, 1], we have 1 + a > e a l 2 . Thus, 
by letting a s = y s /2co, we obtain V*(c,t) < 2coW*(co,t), where W*(co,t) is the maximum possible 
value of X^s=i a s; subject to 

< a s < 1 and a s < {c • (r + s)} r e"^=i as . 

Let us introduce a continuous-time variable r, and define a(r) = a s , for r € [s — 1, s). Let 
6(r) = J T a(r') dr', and note that 6(s) = X^q=i a <?- For an y r ^ [s — 1, s), we have 

6(r) = a s < {c • (r + s)} r e-^ a ° < {c • (r + r + l)}^-£^=i < { Co . ( r + T + l)} r e - ft(r)+1 . 

Let cl(r) = e b ( T \ Then, for any r > 0, 

d(r) = d(r)6(r) < e b ^{c • (r + r + l)} r e - 6 M +1 = {c • (r + r + l)} r e. 

By integrating both sides, we obtain d(t) < ec ° — for all t > 0. Since e/(r + 1) < 1 be 

r > 2, taking logarithms, we obtain 



)ecause 



a s = b(t) = log d(t) < r log c + (r + 1) log(r + t + 1). 

9=1 

The right-hand side above is therefore an upper bound on W*(co,£), which leads to the upper 
bound on V*(c,t), which gives the desired result. □ 

Finally, here is the proof of Theorem 14.11 

Proof. It suffices to establish the regret bound because the risk bound follows immediately from 
taking the expectation. Let Aq = max{l, ?Z 2 /Ao}. It follows from Lemmas IB. 81 [BTTOl and IB. Ill that 



Regret (z, T, UE) < 2 u(r + 2) ||z|| + 2ay/r (log T) VT E 




Z = z 



< 2 u(r + 2) ||z|| + 2a jr~ (log T) VT s/V* {u 2 /X , T - r) 

< 2 u(r + 2) ||z|| + 2a ^ (logT) v 7 ^ y/2Ao {r log A) + (r + 1) log(T + 1)} 

< a 4 r ||z|| + a 5 r VTlog 3/2 T , 



for some positive constants (24 and 05 that depend only on ctq, u, and Aq- 



□ 
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B.3 Bounds for Finitely Many Arms: Proof of Theorem 14.21 

Recall that for any z £ W and u G U r , iV u (z, T) is the number of times that arm u has been 
chosen during the first T periods. To complete the proof of Theorem 14. 2\ it suffices to show that 

E[iV"(z,r)|z = z]<6 + 4a2| ^ |lo f. 

(A»(z)) 2 

Let us fix some z £ R r and u G ZY r . Since N u (z, t) is nondecreasing in t, we can show that for any 
positive integer 9, N u (z,T) < 9 + J2 t =r ^{V t+1 =u and 7V u (z,t) > e} ■ Suppose that w is the optimal 
arm, that is, max v ew,. v'z = w'z. Then, we have that 



l{U t+ i=u} < 1 {u>Z t +B? > w'Z t +i?»} - 1 {u'(Z t -z) > i?. t u } + :ll {w'(Z t -z) < -flf } + ^(w-u/z < 2R^} ' 

Since Pr |u' (z t - z) > ,R t u | Z = z| and Pr jw' (z t - zj < -Rf | Z = z| are bounded above by 
1/i 2 by Lemma |B.6| we can show that 

oo T-l 

E [iV u (z, T) | Z = z] < 9 + 2 ^ — + ^ Pr { (w - u)' z < 2i? f u and iV u (z, i) > 9 | Z = z} , 

t=l t=r 
T-l 

< 4 + 9+ ^Pr{(w-u)'z < and N u (z,t)>9\Z = z] . 

t=r 

Let H = y"l v€ ^ -y-^n N v (z, t)w' . It f ollows from Equation Q and the Sherman-Morrison 
Formula (see 



Sherman and Morrison . 



1950) that 



Ct = (H + N u {z,t)uu'Y 



iV u (z,t)H" 1 uu / H- 1 
1 + N^tzAWii-^u 



which implies that 



H 



I ,|2 / r / H -i N^z^ju'U^u) 2 _ u'H^u , 

| U || Ct u^ t u uti u 1 + iV u( Z)i ) u / H -i u l + Aru( Z)t ) u / H -i u - iV u (z,t) ' 



and therefore, 2Rf = 2a y/\agi^m.m. {r log t, \U r \} ||u|| Ct < (2a y/\U r \ logt) /y/N u (z,t). 



By setting # = 1 + 



4« 2 |£/ r |logT 



we conclude that 2i?" < A u (z) = (w — u)' z whenever 



N u (z,t) > 9. This implies that lr( w _ u y z < 2 Rf an d 7V u (z,t)>6>} = > and we nave tnat 



E[iV u (z,T) | Z = z] < 4 + 1 + 
which is the desired result. 



4a 2 |^ r |logT 
(A u (z)) 2 



< 6 + 



4a 2 |^ r |logT 
(A u (z)) 2 ' 
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